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1. INTRODUCTION 

The increasing use of the internet of things technology has significantly increased the amount of data 
generated daily. Apart from providing the benefits and facilities that improve the quality of life, big data and 
internet of things (IoT) pose the challenge of vulnerability and security issues [1]-[4]. Vulnerability increased 
due to the large number of physical infrastructures connected to the network [5], [6]. To address this issue, the 
network intrusion detection system (NIDS) is an appropriate alternative for securing modern networks [7]. 

The security solutions of NIDS-based IoT have been proposed, such as machine and deep learning- 
based methods [7]-[10]. deep learning (DL) has impressively performed in big data domains [11], including 
computer vision [12], [12] speech recognition [13], and medical imaging [14]-[16]. Also, some studies have 
used the DL algorithm on the intrusion detection system, such as deep Boltzmann machines [17], deep belief 
networks [18], and deep neural networks [19], [20]. The DL technique or its combination improves the 
accuracy and detection speed [21]. Furthermore, it efficiently detects attack variants and patterns [22]. There 
are several challenges to NIDS in efficiently and effectively detecting the network’s abnormal behavior. IoT 
applications’ big data nature presents difficulties with the amount and complexity of data [1]. Many applications 
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encrypt Internet traffic using security protocols. The encryption encourages more diverse and sophisticated 
attacks that need to be detected. Second, in real network traffics, data distribution is usually imbalanced [23]. 
In this case, the number of samples belonging to the regular traffic is much higher than the attack sample. 
Therefore, the classification results lean towards the benign data [24]. 

Most of the datasets used to analyze NIDS, such as network security layer- knowledge discovery in 
database (NSL-KDD) [25], the communications security establishment and the Canadian institute for 
cybersecurity intrusion detection system 2018 dataset (CSE-CIC-IDS2018) [26], have a severe class imbalance 
problem. Several approaches have been employed to solve this problem, such as over-sampling [23], [27], [28], 
under-sampling [29], spread sub-sample [30], and class balancer [23]. However, the fundamental problem of 
class imbalance in attack classes remains an interesting issue to study. Also, data quality issues trigger 
imbalanced data problems [31], [32]. Therefore, other strategies are needed to solve this problem, especially 
for multi-class cases. There have been developments in the fields of focal loss in image recognition, biomedical 
sciences, and stability training [33]-[35]. Using this knowledge, the improved focal loss function for a multi- 
class model is used to prevent class imbalance and over-fitting attack classification. This research focuses on 
efficient training on all data sets, based on the extreme class imbalance. Moreover, the training is based on 
multi-class attacks by utilizing the focal loss function used in the deep learning models. 

This study proposes a multi-class focal loss function of deep learning to address unbalanced data. The 
result is compared with the cross-entropy (CE) loss and weighted cross-entropy functions. As a contribution, 
this study proposes the deep auto-encoder (DAE), combined with deep neural network (DNN) model using a 
multi-class focal loss function. This is aimed to address the different class imbalance for the attack 
classification. Experimental results show that the pre-training stage, deep auto-encoder, has advantages in more 
complicated features learned from the original data. Focal loss function, scaled from cross-entropy loss, is a 
more effective alternative to previous approaches in dealing with the class imbalance in multi-class attack 
classification. 


2. RELATED WORK 

Network Intrusion Detection System has been studied widely over the past several years. This section 
briefly discusses some published approaches to deep learning methods, in particular to imbalanced datasets. In 
2019, Lin et al. [26] used deep learning for dynamic network anomaly detection. The synthetic minority 
oversampling technique (SMOTE) algorithm was experimentally applied to handle the imbalanced class 
problem in the CSE-CICIDS2018 dataset. As a classifier, a deep neural network model was used with long 
short-term memory (LSTM) based, combined with an attention mechanism (AM), to enhance performance. 
The SMOTE algorithm applied to promote the proportion of minority class optimizes the deep learning model. 
The model achieved the best results, with an accuracy of 96.2%, and the recall rate reached 98% for 6 categories 
class. 

More recently, Zhang et al. [27] introduced a hybrid SMOTE that combines SMOTE and Gaussian 
mixture model (GMM) based clustering to improve the minority class's detection rate. The synthetic minority 
over-sampling technique (SMOTE) and gaussian mixture (SGM) processing was integrated with a 
convolutional neural network (CNN) for binary and multi-class classification. They claimed that the SGM 
model increases detection and reduces the time cost. The proposed method was evaluated with 5 classes 
imbalanced technique and 2 classification algorithms. They were verified using the University of New South 
Wales-NB 2015 (UNSW-NB15) and the Canadian institute for cybersecurity intrusion detection system 2017 
(CICIDS2017) datasets. The evaluation of the CICIDS2017 dataset shows that the method achieves an 
excellent detection rate of 99.85% in the 15-class classification. However, the detection rates for web attack 
brute force are still less than 50%, lower than random oversampling (ROS) and SMOTE. As for the 
UNSW-NBI1S5 dataset, the detection rates for binary and 10 classifications reach 99.74% and 96.54%. 

Abdulhammed et al. [23] used various techniques, such as over-sampling, under-sampling, spread 
subsample, and class balancer, to solve imbalanced data problems for binary classes. Several classifiers, such 
as random forest (RF), DNN, voting, variational auto-encoder, are used in the evaluation. The experiments on 
the CIDDS-001 dataset showed that DNN with the down-sampling method and class balancer is the most 
effective. By the experimental results, the class distribution has a light impact on the classification process. 
Furthermore, Abdulhammed et al. [36] proposed the uniform distribution based balancing (UDBB) for 
imbalanced classes. To reduce features, the auto-encoder (AE) and principle component analysis (PCA) were 
used in evaluation with various classifier methods. The simulation results on the original distribution of the 
CICIDS2017 dataset showed that PCA produces better accuracy than AE, at 99.6%. However, by 
implementing UDBB, the detection accuracy was reduced to 98.9%, although it better detected some attacks. 
In another experiment, Hua [29] used under-sampling and feature selection in pre-processing. The proposed 
traffic classification using LightGBM, based on the CSE-CIC-IDS2018 dataset. The model used only 10 
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features that were selected using Random Forest. They compared their models with various machine learning 
algorithms, and CNN deep learning. The best results for the overall accuracy obtained reached 98.37%. 
However, the influence of the model on the minority class was not discussed. 

Yang et al. [37] applied an improved conditional variational autoencoder with a deep neural network 
in NIDS. An improved conditional variational AutoEncoder (ICVAE) training explores the relationship 
between data features and attack classes. This aims at balancing training data sets and improve detection 
performance in minority attack. They used cross-entropy as the function of reconstruction loss of the decoder. 
The results of this challenge showed that the best individual detection system obtains up to 89.08% and 85.97% 
of the multi-class classification in the UNSWB15 and NSL-KDD datasets, respectively. They claimed that 
ICV AE-DNN increases detection rates of minority and unknown attacks. Also, an unsupervised auto-encoder 
was used by Li et al. [38] to overcome imbalance problems in NIDS. They used the random forest to select 
significant features in the CSE-CIC-IDS2018 dataset, and performed anomaly detection for each attack. 
However, the results of AE-IDS for attacks, in web attacks (SQL injection, brute force web, and brute 
force-XSS), are still low and optimized. Similarly, the unsupervised auto-encoder model was used by 
Zhao et al. [39] They introduced the semi-supervised discriminant auto-encoder (SSDA) to overcome new 
attacks. Inspired by existing research, this study uses DAE to extract attack data. Furthermore, the focal loss is 
used to increase the detection rate of minority attacks. The CSE-CIC-IDS2018 dataset is used to test the model 
in multi-class classification and compare the impact of the three-loss functions on unbalanced processes. 


3. RESEARCH METHOD 

This study improves intrusion detection systems' ability to detect minority attacks class using deep 
learning models with deep auto-encoder (DAE) pre-tuning processes and fine-tuning using DNN. The 
classification process used 3 scenarios, including categorical cross-entropy (CE) loss, focal-loss (FL), and 
weighted categorical cross-entropy (WCE), as illustrated in Figure 1. The model was evaluated using 
CSE-CIC-IDS2018, which represents a recent attack dataset [40]. 
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Figure 1. Deep learning architecture-based attacks classification 


3.1. Dataset 

The CSE-CIC-IDS2018 dataset consists of 80 features, including labels. The features of the dataset 
are generated and extracted with CICFlowMeter [40], [41]. The designed scenario consists of 6 attacks, 
including denial of services (DoS), distributed denial of services (DDoS), botnet, brute-force, web attacks, and 
infiltration, as presented in Table 1. They are grouped into 14 attack sub-classes. The total data amount is 
16,232,943 records, dominated by 83.07% benign traffic. This study used 10% data for training and 2.5% for 
testing. About 51% of the data was used for benign composition. The structure of malicious data is used based 
on the amount of data. Table 1 clarifies the composition of the training and testing data used. The infiltration 
attack includes a stealthy attack that utilizes an internal network for illegal access. The characteristics of 
infiltration traffic and benign are very close, which implies a difficulty in detecting the network IDS [42]. As 
a result, the infiltration attack was eliminated in the experiment because this study discussed the focus, 
emphasizing the consideration of imbalanced class factors in accuracy detection. 


Deep learning with focal loss approach for attacks classification (Yesi Novaria Kunang) 


1410 O ISSN: 1693-6930 


Table 1. Composition of the CSE-CIC-IDS2018 dataset used 


Category Size Train Test 

Benign 13,484,708 83.07% 803,025 51.12% 201,238 51.21% 
Bot 286,191 1.76% 85,842 5.46% 21,479 5.47% 
BruteForce 380,949 2.35% 114,387 7.28% 28,465 7.24% 
DDoS 654,300 4.03% 354,907 22.59% 88,593 225970 
DoS 1,263,933 7.19% 212,053 13.50% 52,995 13.49% 
Web Attacks 928 0.01% 754 0.05% 174 0.04% 
Infilteration 161,934 1.00% - - - - 
Total 16,232,943 100.00% — 1,570,968 100.00% 392,944 100.00% 


3.2. Pre-processing 

From 80 features of the CSE-CIC-IDS2018 dataset [40], the timestamp feature was eliminated, and 
only 79 were used. The timestamp is encoded information that explains the occurrence of an attack. It is quite 
essential for prediction in time series. However, it is not essential for classification where the model must 
recognize the attack based on its characteristics. Feature flow duration has more impact on identifying attacks, 
such as DDoS and DoS, due to its rapid nature. In the detection and classification model of attack, the time of 
occurrence is not necessary. This is because, in its implementation, the attack happens at any time. Therefore, 
the feature is eliminated as in previous studies [26], [27], [38]. 

The first stage of dataset pre-processing is feature encoding, which transforms the data from 
categorical into numerical. The feature encoding process, using one-hot encoding, changed the protocol and 
label features into numeric data. The process mapped the protocol feature to 3 instances, including transmission 
control protocol (TCP), user datagram protocol (UDP), and Hop-by-Hop IPv6 (HOPOPT). Also, the label 
features became 6 feature attack categories by eliminating the infiltration class. Finally, 80 features for data 
and 6 feature labels were obtained. The next stage was the feature scaling process to turn all data values into 
the specified range. This process is necessary for features with a high value and not dominating others. Feature 
scaling uses the same approach of Min-max scaling with range [0, 1] as in a previous study [43]. After 
pre-processing, the data is ready for the training and testing process. 


3.3. Deep learning architecture 

The proposed model of the intrusion detection system is designed using the pre-tuning and fine-tuning 
process. The deep learning architecture used automatic feature extraction with deep auto-encoder (DAE) in the 
pre-tuning stage and the deep neural network (DNN) architecture in the fine-tuning stage. DAE performs the 
process of feature extraction with the encoding and decoding phase. Auto-encoder generates output X, which 
is reconstructed from input x. In single-layer Auto-encoder, when the input vector x € R, the vector encoding 
function h, in forward propagation for hidden layer / (1 = 1), is notated as shown in (1). The decoding function 
is notated as shown in (2); 


h = E(x) = f(WYx + bY) (1) 
2 = D(h) = g Wh + b”) (2) 


W are weight matrices, b are bias vectors, x is an input vector form dataset, and f(.) and g(.) are activation 
functions used on a hidden layer. The experiment used several variants of ReLU activation functions, such as 
SeLU, PReLU, ELU, and Leaky ReLU in the hidden layer and sigmoid in the last layer of auto-encoder. 

Deep auto-encoder (DAE) is the development of a single AE with a higher number of layers. The 
function compositions in the encoder and decoder are E and D, respectively the proposed architecture uses 7 
hidden layers, with the output reconstruction £ = D1(D2(D3(E3(E2(E1(%)))))). For the reconstruction 
process input x into the output, x uses the MSE loss function with (3). The backpropagation process produces 
a loss value close to zero 


J(w, b,x! 2) = 2 | fx! — 2] 3) 

A bottleneck layer (middle layer) with a smaller dimension of the input dataset represents the extracted 
features z = E3 = f(W (3). E2 + b®). The result of the feature extraction in the form of an encoding structure 
and the weight and bias values are transferred to the deep neural network (DNN) structure for the fine-tuning 
training process. The encoding result of this structure becomes an input vector in the DNN classifier 
architecture. The DNN model uses the weight, bias and z value of the AE pre-training, to produce the output 
class prediction Y, written as; 
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PE s(W®.z® + b®) (4) 
Output ¥ (prediction target label) will close to y (vector target label) by using the activation function 
softmax s(. ) in the last layer. For all the training datasets (xt, y+), the function of loss can be solved by: 


JW, b) = =E, LO, y') = SE IW, b; x! y’) (5) 


l 
m 
where Z£ is the loss function, and m is the number of datasets. One focus of this study is to compare the £ loss 
function using cross-entropy, weighted cross-entropy, and focal loss. 

The DNN model carried out the training process for classifying multi-class attacks. A hyperparameter 
tuning process is performed to get the best deep learning model by looking at the rate detection result of the 
attack classification. This tuning process tried various model variants based on the number of hidden layers, 
the number of nodes, learning rate value, batch-size, activation function, and kernel initialization to get the best 
model. 


3.4. Loss function 
In the case of a multi-class with the number of classes (C > 2), the equation of the loss function for 
the categorical cross-entropy (CE) is: 


L(t, y’) = CE = — YF, y‘ logs") (6) 


C is the number of classes, yt is the ground truth class, and ĵ$ € [0,1] is the model's predicted probability for 
the class. Where yt = 1 belongs to the actual label of i; otherwise, it equals 0. 

For imbalanced class cases, the CE loss function is modified by adding a weighting factor [44] to 
obtain the CE as shown in (7). 


Weighted CE = —)%_, «' yt log(9") (7) 


where at is the weight factor for class i. 

The deficiency of CE loss is that many samples contribute to a significant accumulation of the loss 
value above the rare class [33], [34]. Therefore, when the extreme balance issue in the case of multi-class attack 
classification is resolved, the scenario takes advantage of the focal loss function proposed by Lin et al. [33]. 
The focal loss function does not provide the same weighted value on all training data. In contrast, focal loss 
reduces the weight of well-classified data. Its impact on focal loss emphasizes training on data that is difficult 
to classify with as shown in (8): 


Lec( 9t yt) = -Xf ad — PV. y logy") (8) 


with y as a modularity factor to reduce the weight of well-classified classes. When y = 0, the loss equals to 
cross-entropy. Therefore, y >= 0 is set to evaluate the effect of samples classified with a loss factor. The 
parameter q is the weight to balance focal loss, and it increases the accuracy value for the imbalance class. 


3.5. Experimental setup and performance metrics 

The experiment was run on the cloud machine in the Google Colaboratory platform. The model was 
developed using the Python programming language with computation utilizing a TensorFlow-GPU library of 
Keras [45], a deep learning framework. The hyperparameter tuning process used Talos Library [46]. 

This observation used accuracy, sensitivity, and specificity to measure the performance of the 
proposed model. The evaluated performance used the accuracy function to assess the model's ability to classify 
attacks correctly. In the case of imbalanced datasets, the predicted result was dominated by large numbers of 
classes. Therefore, it is necessary to examine the model's specificity and sensitivity for imbalanced data set 
case [47]. The sensitivity results showed how precisely the model detected an attack. The specificity showed 
the probability that the model does not make mistakes in recognizing an attack. 


4. RESULTS AND ANALYSIS 
4.1. Hyper-parameter tuning 

The hyper-parameter tuning process is crucial in obtaining a network architecture (number of neurons 
and layers). Moreover, the process was used to obtain the most appropriate hyper-parameter values in the deep 
learning model. In the initial phase, several hyper-parameter processes were performed on several hidden layers 
and nodes, batch size, learning rate, activation function, and kernel initial for deep learning model 
(DAE-DNN). The experiments used categorical cross-entropy as the lost function. The best architecture 
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obtained for the CE loss on the CSE-CIC-IDS2018 dataset is the DAE structure, with 7 hidden layers (80-70- 
40-30-25-30-40-70-80). It was extracted and transferred to the model DNN to become 80-70-40-30-25-6. This 
architecture was obtained by trying different variations in the number of hidden layers DAE [1, 3, 5 and 7]. 
The best learning rate value used was 0.001, with the values of [0.00001, 0.0001, 0.001, 0.01, 0.1]. The 
experiment used the initial lecun_uniform kernel, as well as the Leaky ReLU activation function. The Leaky 
ReLU was previously selected through the tuning process of various activation functions. The batch size value 
for the best model used is 256. Also, this was obtained through the tuning process with batch size variations 
32, 64, and 256. 

After getting the best model with the CE loss function as the basis of comparison, tuning for the focal 
loss parameter was performed. Two parameters were tuned in a multi-class focal loss, in which the settings of 
y reduced the effect of the modulation factor. The a parameter is the weight factor for the class. The focal loss 
parameters tuned are range y € [0, 5] and a € [0, 1], as recommended in [33]. 


4.2. Result of various of focal loss 

The effectiveness of the focal loss function in attack classification was measured by taking the best 
tuning result in focal loss parameters. The training process was performed with the number of epoch=30. 
Table 2 summarizes the overall hyper-parameter tuning results for the focal loss function, with various values 
of y and a. Also, the weight assigned to the rare class has a stable range. However, it interacts with y, making 
it necessary to select the two parameters together, as shown in Tables 2 (a) and 2 (b). In general, æ increased 
slightly as y fluctuated. In this case, a = 0.5 works best when y = 1. The best results are the accuracy value of 
98.223%, the sensitivity of 98.223%, and specificity of 99.814% for the entire attack classes. 

The proposed model reached the highest accuracy metric at y=1. It is reasonable because y minimizes 
the loss contribution of the dominant class sample that is easily classified. When parameter y increases, the 
probability of correct classification (1 — 9') decreases. This probability increases the weight of minority class 
samples that are difficult to be classified. As a result, the model focuses on the difficulty class of classified 
samples that lowers classification accuracy. 


Table 2. Experimental results with a variety of values a and y for classifying attacks with 30 epoch training 
processes. (a) CSE-CIC-IS2018 with a-balanced CE achieves at most 98.21% accuracy. (b) In contrast, using 
FL with the same network with varying y/a gives accuracy at 98.223% at y=1 and o=0.5 settings 


(a) Varying a for CE loss (y=0) (b) Varying y for FL (w. optimal a) 


A Accuracy (%) Sensitivity (%) Specificity (%) Accuracy ME Specificity 
0.1 98.17 98.17 99.79 D (%) ine Ae, (%) 
0.25 98.14 98.14 99.70 0o 0.75 98.210 98.210 99.792 
0.5 98.19 98.19 99.79 0.1 O.1 98.185 98.185 99.788 
0.75 98.21 98.21 99.79 0.2 0.75 98.152 98.152 99.807 
1 98.16 98.16 99.77 0.5 1 98.220 98.220 99.813 
1 0.5 98.223 98.223 99.814 
2 0.5 98.119 98.119 99.778 
5 0.75 98.062 98.062 99.719 


4.3. Performance and comparison 

The results of configuring NIDS with focal loss (NIDS-FL) were evaluated by comparing them with 
cross-entropy loss (NIDS-CE), and weighted cross-entropy loss (NIDS-WCE) accordingly. Equal values of 
network architecture (number of hidden layers, number of nodes), and hyper-parameter value were used. The 
NIDS-CE and NIDS-WCE configuration do not use the y and a. The weighted cross-entropy used a balanced 
mode. It means that this function replicates the smaller class until the number of samples in the minority and 
larger classes is equal. 

In the first stage, training was conducted using epoch=30. After the training process, an evaluation 
was performed using testing data. Figures 2 show all the metric comparisons with various variants of the loss 
function. It shows that for epoch=30, almost the models' overall performance using the focal loss function was 
better than CE and WCE. Respectively, the accuracy value is 98.23%, precision to 98.34%, recall (sensitivity) 
to 98.23%, and specificity to 98.25%, as shown in Figure 2 (a). This research used a multi-class classification 
for BoT, Brute Force, DDOS, DoS, and web attacks. The results showed that NIDS's performance using focal 
loss was higher than cross-entropy and weighted cross-entropy, as presented in Figure 2 (a). The detailed result 
for the entire class may be observed in Table 3. The proposed model excellently detected BoT and DDoS 
attacks, with an overall performance above 99.9%. The overall performance for DoS attacks was higher than 
90%, while for Brute Force, recall performance attained 94%, with a precision of only 84%. The file transfer 
protocol (FTP)-brute attack has a characteristic that resembles the slow-hypertext transfer protocol (HTTP) 
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Dos. Both of them are often misclassified as a result, this reduced the recognition performance of both types 
of attacks. 


Table 3. Performance of each attack class for epoch=30 
Benign BoT Brute Force DDoS DoS — WebAttack 
Precision (%) 


NIDS-CE 99.96 99.96 84.2 99.96 96.08 81.94 

NIDS-WCE 99.94 99.98 83.77 99.82 96.54 97.3 

NIDS-FL 99.97 99.98 83.91 99.98 96.5 96.27 
Recall (%) 


NIDS-CE 99.97 99.97 93.16 99.98 90.6 67.82 

NIDS-WCE 99.92 99.95 94.01 99.98 90.21 41.38 

NIDS-FL 99.97 99.97 93.97 99.99 90.32 74.14 
Fl-Score (%) 

NIDS-CE 99.97 99.96 88.45 99.97 93.26 74.21 

NIDS-WCE 99.93 99.96 88.6 99.9 93.27 58.07 

NIDS-FL 99.97 99.97 88.65 99.98 93.31 83.77 


To investigate the performance of the loss function against the imbalanced dataset, this study 
examined the model’s efficiency in classifying types of attacks, especially in minority classes. The web attacks 
are a minority class that only amounts to 0.05% of the total data trained in Table 1. According to Table 3 and 
Figure 2 (b), the NIDS-FL outperforms the other methods to classify web attacks. There is a significant increase 
in the value of precision, recall, and fl-score compared to models that use CE and WCE losses. The recall 
(sensitivity) reaches 74.14%, implying an approximate increase of 7% from CE as a primary loss function. The 
model that uses WCE in minority classes with epoch=30 does not have excellent sensitivity, although it has a 
good precision value. 

The loss value of the 3 models in Figures 2 (c) and, the FL function, is smaller than the CE and WCE. 
Also, the number of misclassification attacks were compared. The error count value of NIDS-FL in 
Figure 2 (d) is lower than other models, which are only 6965. The superiority of the FL shows the effect of 
modularity and weight factors on focal loss. By selecting the most appropriate modularity and weight for the 
imbalanced class, the loss and misclassification values can be minimized, especially for the minority class. 


Overall Performance Web Attack Class Performances 








98.35% 100.00% 
98.30% 30.00% 
98.25% 
60.00% 
98.20% 
40.00% 
98.15% 
98.10% 20.00% 
98.05% 0.00% 
accuracy Precision Recall F1-score Precision Recall F1-score 
m NIDS-CE @NIDSWCE m NIDS-FL E NIDS-CE E NIDS-WCE E NIDS-FL 
(a) (b) 
Loss Error Count 
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E NIDS-CE @NIDS-WCE WE NIDS-FL mNIDS-CE mNIDS-WCE m@NIDS-FL 
(c) (d) 


Figure 2. Graphs of comparison of various testing result metrics for all models (NIDS-CE, NIDS-WCE, and 
NIDS-FL) for the process epoch = 30; (a) Overall performance of attacks classification; (b) Performances of 
minority class-Web Attack class; (c) Loss value of all model; (d) Error count of all model 
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In the next phase, the model was evaluated by increasing the number of epochs to 200. As shown by 
the loss achieved in Figure 3 (a), the proposed deep learning model using FL with the previously selected 
hyper-parameters converge faster than CE and WCE. Figure 3 (b) shows the network using FL stabilizes after 
around 30 epochs, which is in contrast to 100 epochs and 80 epochs with CE loss and WCE loss, respectively. 
However, with the increasing number of epochs, the models using the cross-entropy function are more stable. 
Also, they tend to keep increasing compared to models that use focal loss and weighted cross-entropy functions. 
The improvement is reasonable because the hyper-parameter tuning process was performed on a model with a 
cross-entropy loss function. It has produced a model with the most appropriate hyper-parameters for deep 
learning. The resulting focal loss curve tends to fluctuate due to factor y. Therefore, with an oscillating curve, 
the model has a slightly higher validation value when the curve reaches its peak. 

Table 4 shows the best comparison of results for the 3 models after 200 epochs. The overall 
performance results are almost the same based on accuracy, precision, recall, Fl-score, and specificity. For 
instance, the recall value indicates a tiny difference of <O.01%. In the web attack as a minority class, the overall 
performances after 200 epochs for models that use focal loss function for the precision, recall, accuracy, and 
Fl-score values, respectively, amounted to 97.76%, 75.29%, and 85.07%. 
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Figure 3. NIDS validation accuracy during training with CE loss, weighted WCE, vs. focal loss; 
(a) comparison of loss value; (b) comparison of accuracy 


Table 4. Comparison of the Highest performances of the NIDS models presented in this study 


Training Set Performances in % (epoch=30) Testing Set Performances in % (epoch=30) 
Acc. Recall/ Fl- Error Acc. Recall/ Fl- Error 
(overall) Prec. Sens. score Specif. Loss Count (overall) Prec. Sens. score Specif. Loss Count 


NIDS-CE 98.20 98.29 98.20 98.22 99.80 0.03404 28330 98.20 98.29 98.20 98.22 99.80 0.03358 7076 
NIDS- 


WCE 98.16 98.28 98.16 98.18 99.79 0.03472 28892 98.17 98.28 98.17 98.19 99.78 0.03441 7191 

NIDS-FL 98.22 98.33 98.22 98.24 99.81 0.00400 27962 98.23 98.34 98.23 98.25 99.81 0.00390 6965 
Training Set Performances in % (epoch=200) Testing Set Performances in % (epoch=200) 

NIDS-CE 98.26 98.38 98.26 98.28 99.82 0.03261 27303 98.27 98.38 98.27 98.29 99.81 0.03282 6814 

NIDS- 

WCE 98.26 98.37 98.26 98.28 99.82 0.03284 27380 98.26 98.36 98.26 98.28 99.81 0.03272 6852 


NIDS-FL 98.27 98.38 98.27 98.29 99.82 0.00140 27218 98.27 98.38 98.27 98.29 99.82 0.00138 6788 


Models with deep auto-encoder pre-training process have advantages in overcoming the imbalance 
problem. An increase in the layers of deep auto-encoder network raises the number of complicated features 
learned from the original data. Transfer-layer with 4 deep encoding layers in DAE (7 hidden layers) improves 
the DNN fine-tuning process's classification results, despite extreme class imbalance problems. Out of the 3 
models, although in the web-attack class where the training data only 0.05% of the total data, the sensitivity 
value reaches 74.14%, especially in those that use the focal loss function. However, in the web-attack class, 
the training data only 0.05% of the total data. In the model that uses cross-entropy, the results are acceptable, 
with a sensitivity reaching almost 68%. On the weighted cross-entropy model, the sensitivity is only 41.38%. 
The weights and bias in the DNN network are initialized according to transferred value from the encoding layer 
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on DAE. The weighted cross-entropy, which uses the weight factor proportional to the class frequency, affects 
the training that is unsuitable for the positive samples at the beginning of the learning process. 

The final loss layer of the deep neural network is the softmax loss function. In DAE optimization, the 
weights are initialized using lecun_uniform. As a result, the output from each layer is uniformly distributed. 
The value of the output layer in the deep neural network for softmax functions is uniformly distributed. 
Appropriately, the minor positive samples are less critical in the initial training stage. Therefore, in deep 
learning using focal loss, the last layer's bias term is initialized to some non-zero value [33]. More focus is 
directed at the positive examples in the early training stage, and the whole training process is likely to be 
effective. In deep learning, cross-entropy loss, and weighted cross-entropy loss was based on weight 
initialization. Different loss functions produced varied predictions on various models. In 30 epochs, the model 
used FL and achieved slightly better accuracy than CE loss. Another advantage of using the FL function is that 
the value of the cost loss is near zero. The loss of the cost function represents how well the model learns 
concerning the training examples. 

In general, without modifying the distribution of the CSE-CIC-IDS2018 dataset, the model results 
were Satisfactory. These results were better than the previous study using deep learning and resampling 
techniques in the same dataset, as shown in Table 5. The overall accuracy obtained at 98.27% was better than 
the deep learning model developed by Lin et al. [26]. An accuracy of 96.2% was achieved by Lin et al. [26], 
using SMOTE algorithm for imbalance class, and DNN model based on long short term memory (LSTM) 
based and, combined with attention mechanism (AM). Also, their study compared the various machine learning 
techniques with datasets that have been conducted over-sampling. All of the multiple methods performances 
are not significant compared to the proposed model in this study. For instance, for web attack classes as 
minority samples, they claimed the models developed reached 98% for a better recall value than the model in 
this study, which is only 75.29%. However, the precision value and F1-score for the web attack class is only 
30%. In contrast, the model precision and F1-Score for web attack classes with focal loss in this study is 97.76% 
and 85.07%, respectively, after 200 epoch training. 

This study is superior to the precision value and Fl-score compared to the performances of previous 
research, as detailed in Table 5. Hua [29] used various machine and deep learning, as well as under-sampling 
and feature selection techniques for pre-processing. The recall value of the model with LightGBM was around 
0.1%, slightly higher than the model proposed in this study. However, the precision value of this model is 
0.24% superior to the model they proposed. Unfortunately, their research did not explain the effect of the model 
on web attacks as minority classes. Zhao et al., [39] with the semi-supervised discriminant auto-encoder 
(SSDA), and Ferrag et al., [48] with Deep Auto-encoder, utilized unsupervised learning without modifying the 
data distribution on the dataset. However, with the hyper-parameter process performed in this study, the deep 
learning model proposed resulted in better detection. This model works better with less data proportion and 
deep learning using focal loss. As a result, it solves much of the imbalance-class problem. 

This study has successfully demonstrated the significance of deep learning. This has been achieved 
using deep auto-encoder as a feature reduction technique with focal loss functions. It has provided better results 
in terms of several performance metrics for IDS, especially in imbalance classes. However, there have been 
certain limitations and constraints in this study. In the evaluation process, the infiltration attack class in the 
initial test was eliminated without modifying the dataset. This is because it often caused misclassification with 
benign class. However, future studies should develop a deep learning model with a two-stage classification 
that detects infiltration attacks. Also, it is believed that the focal loss function optimizes the imbalanced class 
problem using other deep learning algorithms. Subsequently, future studies should try to evaluate the influence 
of the focal loss function with different deep learning algorithms. This study used the CSE-CIC-IDS2018 
dataset in the training and testing processes. Future research should cover an anomaly-based online intrusion 
detection system. 


Table 5. Comparison of the proposed model and previous studies in CSE-CIC-IDS2018 dataset 


Reference Handling ImbalancedClass Classifier Accuracy (%) Precision (%) Recall (%) Fl-Score (%) 
Approach 
[26] SMOTE MLP 90 91 89 - 
LSTM 93 91 93 - 
LSTM+AM 96.2 96 96 - 
[29] Under-sampling LightGBM 98.37 98.14 98.37 98.21 
MLP 97.58 97.23 97.58 97.37 
CNN 98.06 97.60 98.06 97.67 
[39] Original Distribution SSDA 97.09 - 98.00 - 
[48] Original Distribution CNN 97.38 - 97.28 - 
DA 97.37 - 98.18 - 
Our Original Distribution DAE+DNN 
Method (focal loss) 98.27 98.38 98.27 98.29 
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5. CONCLUSION 

This study presented a new deep learning model to address the problem of classifying multi-class 
attacks. The network architecture was partitioned into automatic feature extraction with deep autoencoder using 
7 hidden layers, and a classifier with a fully connected deep neural network. The focal loss function was 
adjusted to the proposed model in an imbalanced dataset. The proposed deep learning model used the focal 
loss to obtain a faster convergence than cross-entropy loss and weighted loss. Concerning web attack classes 
as minority samples, the evaluation results of the CSE-CIC-IDS2018 show that the deep learning method with 
focal loss is a high-quality classifier with 98.38% precision, 98.27% sensitivity, and 99.82% specificity. 
Several future studies should be built on this research in several aspects. First, using the focal-loss function on 
imbalanced datasets should be evaluated by comparing them with various datasets. In this research, the 
infiltration attack class was eliminated, which behaves in the same way as benign traffic. However, future 
studies should improve a deep learning model that uses two stages to filter the infiltration attacks. 
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